Management of heterogeneous workloads

ABSTRACT

Systems and methods for managing a system of heterogeneous workloads are provided. Work that enters the system is separated into a plurality of heterogeneous workloads. A plurality of high-level quality of service goals is gathered. At least one of the plurality of high-level quality of service goals corresponds to each of the plurality of heterogeneous workloads. A plurality of control functions are determined that are provided by virtualizations on one or more containers in which one or more of the plurality of heterogeneous workloads run. An expected utility of a plurality of settings of at least one of the plurality of control functions is determined in response to the plurality of high-level quality of service goals. At least one of the plurality of control functions is exercised in response to the expected utility to effect changes in the behavior of the system.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a divisional of U.S. patent application Ser. No.11/741,875 filed on Apr. 30, 2007, the disclosure of which isincorporated by reference herein in its entirety.

FIELD OF THE INVENTION

The present invention relates generally to application management, andmore particularly, to methods and apparatus for management ofheterogeneous workloads

BACKGROUND OF THE INVENTION

Many organizations rely on a heterogeneous set of applications todeliver critical services to their customers and partners. This set ofapplications includes web workloads typically hosted on a collection ofclustered application servers and a back-end tier database. Theapplication mix also includes non-interactive workloads such asportfolio analysis, document indexing, and various types of scientificcomputations. To efficiently utilize the computing power of theirdatacenters, organizations allow these heterogeneous workloads toexecute on the same set of hardware resources and need a resourcemanagement technology to determine the most effective allocation ofresources to particular workloads.

A traditional approach to resource management for heterogeneousworkloads is to configure resource allocation policies that govern thedivision of computing power among web and non-interactive workloadsbased on temporal or resource utilization conditions. With a temporalpolicy, the resource reservation for web workloads varies between peakand off-peak hours. Resource utilization policies allow non-interactiveworkload to be executed when resource consumption by web workload fallsbelow a certain threshold. Typically, resource allocation is performedwith a granularity of a full server machine, as it is difficult toconfigure and enforce policies that allow server machines to be sharedamong workloads. Coarse-grained resource management based on temporal orresource utilization policies has previously been automated. See, K.Appleby et al., “Oceano—SLA-Based Management of a Computing Utility,”IFIP/IEEE Symposium on Integrated Network Management, Seattle, Wash.,May 2001; and Y. Hamadi, “Continuous Resources Allocation in InternetData Centers,” IEEE/ACM International Symposium on Cluster Computing andthe Grid, Cardiff, UK, May 2005, pp. 566-573.

Once server machines are assigned to either the web or thenon-interactive workload, existing resource management policies can beused to manage individual web and non-interactive applications. In thecase of web workloads, these management techniques involve flow controland dynamic application placement. See, C. Li et al., “PerformanceGuarantees for Cluster-Based Internet Services,” IEEE/ACM InternationalSymposium on Cluster Computing and the Grid, Tokyo, Japan, May 2003; G.Pacifici et al., “Performance Management for Cluster-Based WebServices,” IEEE Journal on Selected Areas in Communications, Vol. 23,No. 12, December 2005; and A. Karve et al., “Dynamic Placement forClustered Web Applications,” World Wide Web Conference, Edinburgh,Scotland, May 2006. In the case of non-interactive workloads, thetechniques involve job scheduling, which may be performed based onvarious existing scheduling disciplines. See, D. Feitelson et al.,“Parallel Job Scheduling—a Status Report,” 10th Workshop on JobScheduling Strategies for Parallel Processing, 2004, pp. 1-16. Toeffectively manage heterogeneous workloads, a solution is needed thatcombines flow control and dynamic placement techniques with jobscheduling.

SUMMARY OF THE INVENTION

The embodiments of present invention provide a system and method formanagement of heterogeneous workloads.

For example, in one aspect of the present invention, a method formanaging a system of heterogeneous workloads is provided. Work thatenters the system is separated into a plurality of heterogeneousworkloads. A plurality of high-level quality of service goals isgathered. At least one of the plurality of high-level quality of servicegoals corresponds to each of the plurality of heterogeneous workloads. Aplurality of control functions are determined that are provided byvirtualizations on one or more containers in which one or more of theplurality of heterogeneous workloads run. An expected utility of aplurality of settings of at least one of the plurality of controlfunctions is determined in response to the plurality of high-levelquality of service goals. At least one of the plurality of controlfunctions is exercised in response to the expected utility to effectchanges in the behavior of the system.

In additional embodiments of the present invention, Web applications areplaced on one or more of a plurality of heterogeneous server machinesthrough a placement controller driven by utility functions of allocatedCPU demand. Web application requests are received at a request router.The web application requests are dispatched from the request router toone or more web applications on one or more of the plurality ofheterogeneous server machines in accordance with a scheduling mechanism.The scheduling mechanism is dynamically adjusted in response to at leastone of workload intensity and system configuration. Jobs are allocatedto one or more of the plurality of heterogeneous server machines inaccordance with placement decisions communicated to a job scheduler bythe placement controller.

In another aspect of the present invention, a system for management ofheterogeneous workloads comprises a plurality of heterogeneous servermachines. The system further comprises a placement controller, driven byutility functions of allocated computer processing unit (CPU) demand,which places web applications on one or more of the plurality ofheterogeneous server machines. A request router receives and dispatchesrequests to one or more web applications on one or more of the pluralityof heterogeneous server machines in accordance with a schedulingmechanism. A flow controller in communication with the request routerand the placement controller dynamically adjusts the schedulingmechanism in response to at least one of workload intensity and systemconfiguration. A job scheduler allocates jobs to one or more of theplurality of heterogeneous server machines in accordance with placementdecisions communicated to the job scheduler by the placement controller.

In additional embodiments of the present invention, the system may alsocomprise a web application workload profiler that obtains profiles forweb application requests in the form of an average number of CPU cyclesconsumed by requests of a given flow, and provides the profiles for webapplication requests to the flow controller and placement controller.The system may also comprise a job workload profiler that obtainsprofiles for jobs in the form of at least one of the number of CPUcycles required to complete the job, the number of threads used by thejob, and the maximum CPU speed at which the job may progress, andprovides the profiles for jobs to the job scheduler.

In further embodiments of the present invention the system may comprisea plurality of domains in the each of the plurality of heterogeneousserver machines. A first domain of the plurality of domains may comprisea node agent in communication with the placement controller and the jobscheduler. A second domain of the plurality of domains may comprise amachine agent in communication with the node agent that manages virtualmachines inside a given heterogeneous server machine. The node agentprovides job management functionality within the heterogeneous servermachine through interaction with the machine agent. The machine agent iscapable of at least one of creating and configures a virtual machineimage for a new domain, copying files from the second domain to anotherdomain, starting a process in another domain, and controlling themapping of physical resources to virtual resources.

A system is provided that considerably improves the way heterogeneousworkloads are managed on a set of heterogeneous server machines usingautomation mechanisms provided by server virtualization technologies.The system introduces several novel features. First, it allowsheterogeneous workloads to be collocated on any server machine, thusreducing the granularity of resource allocation. This is an importantaspect for many organizations that rely on a small set of powerfulmachines to deliver their services, as it allows for a more effectiveresource allocation when any workload requires a fractional machineallocation to meet its goals. Second, the approach uses high-levelperformance goals, as opposed to lower-level resource requirements, todrive resource allocation. Hence, unlike previous techniques, whichmanage virtual machines according to their defined resourcerequirements, an embodiment of the present invention provides anapplication-centric view of the system in which a virtual machine isonly a tool used to achieve performance objectives. Third, an embodimentof the present invention exploits a range of new automation mechanismsthat will also benefit a system with a homogeneous, particularlynon-interactive, workload by allowing more effective scheduling of jobs.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram illustrating a management methodology for asystem of heterogeneous workloads, according to an embodiment of thepresent invention;

FIG. 2 is a diagram illustrating management system architecture forheterogeneous workloads, according to an embodiment of the presentinvention;

FIG. 3 is a diagram illustrating management architecture for Xenmachines, according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a life-cycle of a Xen domain, accordingto an embodiment of the present invention;

FIG. 5 is a flow diagram illustrating a management methodology for asystem of heterogeneous workloads for the system architecture of FIG. 2,according to an embodiment of the present invention.

FIG. 6 is a table illustrating jobs used in experiments, according to anembodiment of the present invention;

FIG. 7 is a table illustrating runtime of virtual machine operations forvarious contained jobs, according to an embodiment of the presentinvention;

FIG. 8 is a diagram illustrating response time for a web-basedtransactional test application and job placement on nodes, according toan embodiment of the present invention;

FIG. 9 is a diagram illustrating node utilization by long running jobs,according to an embodiment of the present invention;

FIG. 10 is a graph illustrating a percentage of jobs that have not mettheir completion time goal, according to an embodiment of the presentinvention;

FIG. 11 is a graph illustrating suspend operations, according to anembodiment of the present invention;

FIG. 12 is a graph illustrating a sum of migrations and move-and-restoreactions, according to an embodiment of the present invention; and

FIG. 13 is a block diagram illustrating an illustrative hardwareimplementation of a computing system in accordance with which one ormore components/methodologies of the invention may be implemented,according to an embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Integrated automated management of heterogeneous workloads is achallenging problem for several reasons. First, performance goals fordifferent workloads tend to be of different types. For interactiveworkloads, goals are typically defined in terms of average or percentileresponse time or throughput over a certain time interval, whileperformance goals for non-interactive workloads concern the performanceof individual jobs. Second, the time scale of management is different.Due to the nature of their performance goals and short duration ofindividual requests, interactive workloads lend themselves to automationat short control cycles. Non-interactive workloads typically requirecalculation of a schedule for an extended period of time. Extending thetime scale of management requires long-term forecasting of workloadintensity and job arrivals, which is a difficult if not impossibleproblem to solve. Server virtualization assists in avoiding this issueby providing automation mechanisms by which resource allocation may becontinuously adjusted to the changing environment. Thus, to collocateapplications on a physical resource, one must know the applications'behavior with respect to resource usage and be able to enforce aparticular resource allocation decision. For web applications, with thehelp of an L7 gateway, one can rather easily observe workloadcharacteristics and, taking advantage of similarity of web requests andtheir large number, derive reasonably accurate short-time predictionsregarding the behavior of future requests. Non-interactive jobs do notexhibit the same self-similarity and abundance properties, hencepredicting their behavior is much harder. Enforcing a resourceallocation decision for web workloads can also be achieved relativelyeasily by using flow control mechanism. Server virtualization gives ussimilar enforcement mechanisms for non-interactive applications.

While server virtualization allows for better management of workloads totheir respective SLA goals, it also introduces considerable challengesin order to use it effectively. They concern the configuration andmaintenance of virtual images, infrastructure requirements to make aneffective use of the available automation mechanisms, and thedevelopment of algorithmic techniques capable of utilizing the largernumber of degrees of freedom introduced by virtualization technologies.Embodiments of the present invention address some of these challenges.

Referring initially to FIG. 1, a flow diagram illustrates a managementmethodology for a system of heterogeneous workloads, according to anembodiment of the present invention. The methodology begins in block102, where work that enters the system is separated into a plurality ofheterogeneous workloads. In block 104, a plurality of high-level qualityof service goals are gathered. At least one of the plurality ofhigh-level quality of service goals corresponds to each of the pluralityof heterogeneous workloads. In block 106, a plurality of controlfunctions are determined that are provided by virtualizations on one ormore containers in which one or more of the plurality of heterogeneousworkloads run. In block 108, an expected utility of a plurality ofsettings of at least one of the plurality of control functions isdetermined in response to the plurality of high-level quality of servicegoals. In block 110, at least one of the plurality of control functionsis exercised in response to the expected utility to effect changes inthe behavior of the system.

Referring now to FIG. 2, a diagram illustrates management systemarchitecture, according to an embodiment of the present invention. Thissystem architecture represents one specific example of managementsystem, a plurality of different system architectures that perform themethodology of the present invention as illustrated in FIG. 1 are alsopossible. The managed system includes a set of heterogeneous servermachines, referred to henceforth as node 1 202, node 2 204 and node 3206. Web applications, app A 208, app B 210, which are served byapplication servers, are replicated across nodes to form applicationserver clusters. Requests to these applications arrive at an entryrequest router 212 which may be either an L4 or L7 gateway thatdistributes requests to clustered applications 208, 210 according to aload balancing mechanism. Long-running jobs are submitted to a jobscheduler 214, placed in its queue, and dispatched from the queue basedon the resource allocation decisions of the management system.

The management architecture of FIG. 2 takes advantage of an overloadprotection mechanism that can prevent a web application from utilizingmore than the allocated amount of resources. Such overload protectionmay be achieved using various mechanisms including admission control orOS scheduling techniques. Server virtualization mechanisms could also beapplied to enforce resource allocation decisions on interactiveapplications.

In the system considered, overload protection for interactive workloadsis provided by an L7 request router 212 which implements a flow controltechnique. Router 212 classifies incoming requests into flows dependingon their target application and service class, and places them inper-flow queues. Requests are dispatched from the queues based onweighted-fair scheduling discipline, which observes a system-wideconcurrency limit. The concurrency limit ensures that all the flowscombined do not use more than their allocated re-source share. Theweights further divide the allocated resource share among applicationsand flows.

Both the concurrency limit and scheduling weights are dynamicallyadjusted by a flow controller 216 in response to changing workloadintensity and system configuration. Flow controller 216 builds a modelof the system that allows it to predict the performance of the flow forany choice of concurrency limit and weights via optimizer 218. Thismodel may also be used to predict workload performance for a particularallocation of CPU power. The functionality of flow controller 216 isused to come up with a utility function for each web application atutility function calculator 220, which gives a measure of applicationhappiness with a particular allocation of CPU power given its currentworkload intensity and performance goal.

Long-running jobs are submitted to the system via job scheduler 214,which, unlike traditional schedulers, does not make job execution andplacement decisions. In the system, job scheduler 214 only managesdependencies among jobs and performs resource matchmaking Oncedependencies are resolved and a set of eligible nodes is determined,jobs are submitted to an application placement controller (APC) 222 viaa job queue manager 224.

Each job has an associated performance goal. An embodiment of thepresent invention supports completion time goals, but the system may beextended to handle other performance objectives. From this completiontime goal an objective function is derived which is a function of actualjob completion time. When job completes exactly on schedule, the valueof the objective function is zero. Otherwise, the value increases ordecreases linearly depending on the distance of completion time from thegoal.

Job scheduler 214 uses APC 222 as an adviser to where and when a jobshould be executed. When APC 222 makes a placement decision, actionspertaining to long-running jobs are returned to job scheduler 214 andput into effect via a job executor component 226. Job executor 226monitors job status and makes it available to APC 222 for use insubsequent control cycles.

APC 222 provides the decision-making logic that affects placement ofboth web and non-interactive workloads. To learn about jobs in thesystem and their current status, APC 222 interacts with job scheduler214 via a job scheduler proxy 228. A placement optimizer 230 calculatesthe placement that maximizes the minimum utility across allapplications. It is able to allocate CPU and memory to applicationsbased on their CPU and memory requirements, where memory requirement ofan application instance is assumed not to depend on the intensity ofworkload that reaches the instance. The optimization algorithm of APC222 is improved; its inputs are modified from application CPU demand toa per-application utility function of allocated CPU speed, and theoptimization objective is changed from maximizing the total satisfiedCPU demand to maximizing the minimum utility across all applications. Aweb application placement executor 232 places applications on nodes 202,204, 206 in an optimized manner.

Since APC 222 is driven by utility functions of allocated CPU demandand, for non-interactive workloads, objective functions of achievedcompletion times are only given, a way to map completion time into CPUdemand, and vice versa, may also be provided. Recall that for webtraffic a similar mechanism exists, provided by the flow controller. Therequired mapping is very difficult to obtain for non-interactiveworkloads, because the performance of a given job is not independent ofCPU allocation to other jobs. After all, when not all jobs cansimultaneously run in the system, the completion time of a job that iswaiting in the queue for other jobs to complete before it may be starteddepends on how quickly the jobs that were started ahead of it complete,hence it depends on the CPU allocation to other jobs. In the system,simple but effective heuristics are implemented that allow aggregate CPUrequirements to be estimated for all long-running jobs for a given valueof utility function at job utility estimator 234. This estimation isused to obtain a set of data-points from which the utility function islater extrapolated. This estimation is used to obtain a set ofdata-points from which values needed to solve the optimization problemare later extrapolated.

To manage web and non-interactive workloads, APC relies on the knowledgeof resource consumption by individual requests and jobs. The systemincludes profilers for both kinds of workloads. A web workload profiler236 obtains profiles for web requests in the form of the average numberof CPU cycles consumed by requests of a given flow. A job workloadprofiler 238 obtains profiles for jobs in the form of the number of CPUcycles required to complete the job, the number of threads used by thejob, and the maximum CPU speed at which the job may progress.

Features of virtualization of which the system in an embodiment of thepresent invention is capable of taking advantage are briefly enumerated.

-   -   PAUSE When a virtual machine is paused, it does not receive any        time on the node's processors, but the virtual machine remains        in memory.    -   RESUME Resumption is the opposite of pausing—the virtual machine        is once again allocated execution time on the node's processors.    -   SUSPEND When a virtual machine is suspended, its memory image is        saved to disk, and it is unloaded.    -   RESTORE Restoration is the opposition of suspension—an image of        the virtual machine's memory is loaded from disk, and the        virtual machine is permitted to run again.    -   MIGRATE Migration is the process of moving a virtual machine        from one node to another. In standard migration, the virtual        machine is first paused, then the memory image is transferred        across the network to the target node, and the virtual machine        is resumed.    -   LIVE_MIGRATE Live migration is a version of migration in which        the virtual machine is paused. Instead, the memory image is        transferred over the network whilst the virtual machine runs,        and (when the memory images on both nodes match) control passes        to the new host.    -   MOVE_AND_RESTORE When a virtual machine has already been        suspended, and needs to be restored on a different node, the        management system must first move the saved memory image to the        new node, and then restore the virtual machine on the new host        node.    -   RESOURCE_CONTROL Resource control modifies the amounts of        various resources that virtual machines can consume, such as,        for example, CPU and memory.

While virtualization can be provided using various technologies, anembodiment of the present invention uses Xen as it is capable ofproviding the wide variety of controls discussed above. Xen is an x86virtual machine monitor capable of running multiple commodity operatingsystems on shared hardware. Although it typically requires that guestoperating systems be modified, user-level code can execute in guest VMS,called domains, without modification. Xen provides a series of controls,including those discussed above. All of these controls are most directlyaccessible from a special domain on each Xen-enabled node, labeleddomain 0.

The system relies on an entry gateway that provides flow control for webrequests. The entry gateway provides a type of high-level virtualizationfor web requests by dividing CPU capacity of managed nodes amongcompeting flows. Together with an overload protection mechanism, theentry gateway facilitates performance isolation for web applications.

Server virtualization could also be used to provide performanceisolation for web applications. This would come with a memory overheadcaused by additional copies of the OS that would have to be present onthe node. Hence, it is believed that middleware virtualizationtechnology is a better choice for managing the performance of webworkloads.

Since middleware virtualization technology can only work forapplications whose request-flow it can control, a lower level mechanismmust be used to provide performance isolation for other types ofapplications. As outlined in the previous section, server virtualizationprovides powerful mechanisms to control resource allocation of non-webapplications.

Referring now to FIG. 3, a diagram illustrates management architecturefor Xen machines, according to an embodiment of the present invention.To manage virtual machines (VMs) inside a physical Xen-enabled node, acomponent has been implemented, called a machine agent 302, whichresides in domain 0 of a given node so as to have access to the Xendomain controls. Machine agent 302 provides a Java-based interface tocreate and configure a VM image for a new domain, copy files from domain0 to another domain, start a process in, another domain, and to controlthe mapping of physical resources to virtual resources.

Xen is used to provide on-line automation for resource management, henceit is desirable to make management actions light-weight and efficient.This consideration concerns the process of creating virtual images,which may be quite time consuming. Substantial delays are avoided, whichwould otherwise be incurred each time it is intended to start a job fromjob scheduler 314, by pre-creating a set of images in accordance with acommand executor 304 and a repository 306, for use during runtime. Thedispensing of these pre-created images is performed by image managementsubsystem 308. Images once used to run a process are scrubbed of thatprocess data and may be reused by future processes. In small-scaletesting thus far, it has been found sufficient to pre-create a smallnumber of images; however, image management subsystem 308 may beextended to dynamically extend the pool of available images if needed.

Inside a created image, a new process may be created. This is done bypopulating the image with the files necessary to run that new process.In the system, it is assumed that the files required for all processesthat may run on the node are placed in its domain 0 in advance. Hence,there is only a need to copy them from domain 0 to the created image.Clearly, there are mechanisms that would allow us to transfer files froman external repository to a node where the process is intended to run.

Before it may be booted, an image must be provided with configurationfiles to set up its devices and networking. This functionality isencapsulated by a configuration management subsystem 310. To assign anIP address and DNS name, a DHCP server can be used, although in thesystem a simpler, more restrictive, module has been implemented thatselects configuration settings from a pool of available values.

Referring now to FIG. 4, a diagram illustrates a life-cycle of a Xendomain, according to an embodiment of the present invention. An image,once configured, may then be booted. Once in the running state, it maybe suspended or paused. New processes may be created and run inside it.An image that is either running or paused may also be resourcecontrolled. Migration may be used to transfer the image to another node.A suspend-move-and-restore mechanism has been implemented by which thedomain is suspended on one machine, the checkpoint and image files arecopied to another node, and the domain is restored on the new host node.This allows us to study the benefits of migration.

Referring again to FIG. 3, Xen provides resource control mechanisms tomanage memory and CPU usage by its domains. Memory is set for a domainbased on configured or profiled memory requirements. CPU allocation isset for a domain based on autonomic decisions of APC 322, which resultsfrom its optimization technique. The CPU allocation to a domain may belower that the amount of CPU power actually required by a processrunning inside a domain. Both memory and CPU allocations to a domain maychange while the domain is running based on changing processrequirements and decisions of APC 322.

CPU allocation to domains may be controlled in Xen using threemechanisms. First, the number of virtual CPUs (vCPUs) can be selectedfor any VM. Typically, the number of vCPUs depends on the parallelismlevel of a process that executes inside a domain. Second, vCPUs may bemapped to physical CPUs at a virtual-to-physical resource mapper 312. By‘pinning’ vCPUs of a domain to different physical CPUs the performanceof the domain may be improved. Finally, CPU time slices may beconfigured for each domain. When all vCPUs of a domain are mapped todifferent physical CPUs, allocation of 50 out of 100 time slices to thedomain implies that each vCPU of the domain will receive 50% of thecompute power of the physical CPU to which it is mapped. Xen alsopermits borrowing, by which CPU slices allocated to a domain that doesnot need them can instead be used by other domains.

In a default configuration provided by Xen, each domain receives thesame number of vCPUs as there are physical CPUs on a machine. Each ofthose vCPUs will be mapped to a different physical CPU and receives 0time slices with CPU borrowing turned on. In the process of managing thesystem this allocation is modified inside virtual-to-physical resourcemapper 312. When a domain is first started, Xen is allowed to create thedefault number of vCPUs and map them to different physical CPUs. Onlythe number of time slices is set to obtain the CPU allocation requestedby placement controller. While domain is running, its actual CPU usageis observed. If it turns out that the domain is not able to utilize allvCPUs it has been given, it may be concluded that the job is notmulti-threaded. Hence, to receive its allocated CPU share, its vCPUsmust be appropriately reduced and remapped. Virtual-to-physical resourcemapper 312 must attempt to find a mapping that provides the domain withthe required amount of CPU power spread across the number of vCPUs thatthe job in the domain can use—clearly, this is not always possible.

All the VM actions provided by the machine agent are asynchronous JMXcalls. They are followed by JMX notifications indicating the completionof an action.

To hide the usage of VMs from a user, a higher-layer of abstraction isimplemented, embedded inside a node agent 316, which provides the jobmanagement functionality. It provides operations to start, pause,resume, suspend, restore, and resource control a job. To implement theseoperations, the node agent interacts with command executor 304 ofmachine agent 302 in domain 0 using its VM management interfaces. When ajob is first started, node agent 310 creates (or obtains a pre-created)image in which to run the job. It records the mapping between the job IDand VM ID in a VM ID repository 318. Then it asks machine agent 302 tocopy corresponding process binaries to the new image and to boot theimage. Once domain is running, the job is started inside it.

Observe that a job is placed in its own domain. This providesperformance isolation among jobs such that their individual resourceusage is controlled, but it comes at the expense of added memoryoverhead. The system maybe extended such that it allows collocation ofmultiple jobs inside a single domain based on some policies.

The node agent process is placed in domain 1, which is the domain usedfor all web applications. There are two reasons for placing the nodeagent in a separate domain than domain 0. First, the application servermiddleware already provides a node agent process with all requiredmanagement support, thus adding new functionality is a matter of asimple plug-in. Second, domain 0 is intended to remain small andlight-weight. Hence, using it to run functionality that does notdirectly invoke VM management tools is avoided.

Like the machine agent, the node agent exposes its API using JMX.

In FIG. 3 the organization of a Xen-enabled server machine used in thesystem is shown. At least two domains are run, domain 0 with machineagent 302, and domain 1 with node agent 316 and all web applications.Since resource control for web applications is provided by requestrouter and flow controller, such collocation of web applications doesnot affect the ability to provide performance isolation for them.Domains for jobs are created and started on-demand.

Referring now to FIG. 5, a flow diagram illustrates a managementmethodology for a system of heterogeneous workloads in the systemarchitecture of FIG. 2, according to an embodiment of the presentinvention. The methodology begins in block 502 where a web applicationplacement is calculated that maximizes the minimum utility across allweb applications. In block 504, web applications are placed on nodes bya placement controller driven by utility functions of allocated CPUdemand. In block 506, web application requests are received at a requestrouter. In block 508, requests are classified into flows at the requestrouter depending on target web application and service class. In block510, classified requests are placed into per-flow queues within therequest router. In block 512, the web application requests aredispatched from the request router to web applications on the nodes inaccordance with a weighted-fair scheduling discipline that observes aconcurrency limit of the system. In block 514, a model of the system isbuilt that allows a flow controller to predict the performance of a flowfor any choice of concurrency limit and weights. In block 516, theweights of the weighted-fair scheduling discipline and the concurrencylimit of the system are dynamically adjusted in response to workloadintensity and system configuration.

In block 518, dependencies among jobs are managed at a job scheduler. Inblock 520, jobs are matched with nodes. In block 522, jobs are receivedfrom the job scheduler at a job scheduler proxy of the placementcontroller. In block 524, requirements for jobs for a given value ofutility function are estimated at a job utility estimator of theplacement controller. In block 526, job placements are calculated thatmaximizes the minimum utility across all web applications at a placementoptimizer of the placement controller. In block 528, the job placementsare communicated to the job scheduler through the job scheduler proxy.In block 530, jobs are allocated to the nodes in accordance withplacement decisions communicated to a job scheduler by the placementcontroller.

In block 532, profiles for web application request are obtained in theform of an average number of CPU cycles consumed by requests of a givenflow. In block 534, profiles for jobs are obtained in the form of CPUcycles required to complete the job, the number of threads used by thejob or the maximum CPU speed at which the job may progress. Finally, inblock 536, the profiles for web application request are communicated tothe placement controller and the flow controller, and the profiles forjobs are communicated to the job scheduler.

Provided below are examples of experimental evaluations, implementationsand integrations, additional implementations for embodiments of thepresent invention are also possible.

An approach of the present invention is experimentally evaluated usingboth real system measurements and a simulation. The system in anembodiment of the present invention has been implemented and integratedwith WebSphere Extended Deployment application server middleware.WebSphere Extended Deployment is used to provide flow control for webapplications and use Xen virtual machines to provide performanceisolation for non-interactive workloads.

In the experiments, a single micro-benchmark web application is usedthat performs some CPU intensive calculation interleaved with sleeptimes, which simulate backend database access or I/O operations. A setof non-interactive applications is also used, which consists of wellknown CPU-intensive benchmarks. In particular, BLAST, from The NationalCenter for Biotechnology Information (NCBI), Lucene, from ApacheSoftware Foundation, ImageMagick, and POV-Ray, from Persistence ofVision Pty. Ltd., are used as representative applications forbioinformatics, document indexing, image processing and 3D renderingscenarios respectively. BLAST (Basic Local Alignment Search Tool) is aset of similarity search programs designed to explore all of theavailable sequence databases for protein or DNA queries. Apache Luceneis a high-performance, full-featured, open-source text search enginelibrary written entirely in Java. In the experiments, the exampleindexing application provided has been run with the Lucene library toindex a large set of files previously deployed in the file system.POV-ray (Persistence of Vision Raytracer) is a high-quality free toolfor creating three-dimensional graphics. ImageMagick is a software suiteto create, edit, and compose bitmap images.

Referring now to FIG. 6 a table illustrates jobs used in experiments,according to an embodiment of the present invention. In the experiments,six different jobs are submitted, whose properties are shown in FIG. 6.Differentiation of execution time is achieved by choosing differentparameters, or by batching multiple invocations of the same application.All used applications except BLAST are single-threaded; hence they canonly use one CPU. In addition, Lucene is I/O intensive; hence it cannotutilize a full speed of a CPU. Jobs are assigned to three serviceclasses. Completion time goal for each job is defines relative to itsprofiled execution time and is equal to 1.5, 3 and 10 for platinum,gold, and silver class, respectively.

The system in an embodiment of the present invention is experimented ona cluster of two physical machines, xd018 and xd020, each with two 3 GHzCPUs and 2 GB memory. The XenSource-provided Xen 3.0.2 packages are usedfor RedHat Enterprise Linux 4.

While testing, it is determined that the resource control actions of theversion of Xen are rather brittle and cause various internal failuresacross the entire Xen machine. Therefore, in the experiments, resourcecontrol actions in the machine agent code are suppressed.

The effectiveness of automation mechanisms used by the system in anembodiment of the present invention may be studied. Three different jobsare taken from the set, JOB1, JOB2, and JOB5, and perform variousautomation actions on them while measuring their duration. Migration isnot measured because it is not set up in the system in an embodiment ofthe present invention. Instead, move-and-restore is used. Clearly, thisis quite an inefficient process, mostly due to the overhead of copyingthe image. A dramatically different result is expected oncelive-migration is put in place.

Referring now to FIG. 7, a table illustrates runtime of VM operationsfor various contained jobs, according to an embodiment of the presentinvention. The domain creation time includes the time taken to createthe domain metadata, such as configuration files. Process creationinvolves copying process files into process target domain while domainis in running state. Suspend and restore operations involve creating acheckpoint of domain memory and saving it to disk, and restoring domainmemory from checkpoint on disk, respectively. The checkpoint copyoperation involves transferring checkpoint file between machines in thesame LAN. The checkpoint file is practically equal in size to the amountof RAM memory allocated to a domain. Similarly, time to copy an image ismeasured between two machines in LAN. There is a clear relationshipbetween domain RAM size and its checkpoint copy time, and between domainimage size and image copy time. Both copy image and copy checkpoint canbe avoided when shared storage is available. Migration time includessuspend, resume, copy image and copy checkpoint, and could be greatlyreduced with the use of shared storage.

An experiment to demonstrate the benefits of using server virtualizationtechnology in the management of heterogeneous workloads is described.StockTrade (a web-based transactional test application) is deployed indomain on two machines xd018 and xd020. Load to StockTrade is variedusing a workload generator that allows for control of the number ofclient sessions that reach an application. Initially, 55 sessions arestarted and with this load it is observed that response time ofStockTrade requests is about 380 ms and approaches response time goal of500 ms. Referring now to FIG. 8, a diagram illustrates response time forStockTrade and job placement on nodes, according to an embodiment of thepresent invention. At this load intensity, StockTrade consumes about ⅚of CPU power available on both machines. Then JOB5 (A) is submitted.Recall from FIG. 5 that JOB5 is associated with platinum service classand therefore has completion time goal equal to 1.5 to its expectedexecution time. After a delay caused by the duration placement controlcycle (B) and domain starting time, JOB5 is started (C) in domain 2 onxd020 and, in the absence of any resource control mechanism, allocatesit the entire requested CPU speed, which is equivalent to 0.6 CPU. As aresult of decreased CPU power allocation to domain 1, on xd020, theresponse time for StockTrade increases to 480 ms, but it stays below thegoal. A few minutes after submitting JOB5, JOB1 (D) is submitted, whoseservice class is bronze. JOB1 has a very relaxed completion time goalbut it is very CPU demanding. Starting it now would take 2CPUs from thecurrent StockTrade allocation.

At 800 s since the beginning of the experiment, load is reduced toStockTrade to 25 concurrent client sessions. When CPU usage ofStockTrade reduces to about 50% of each machine, the placementcontroller decides (E) to start JOB1 (F) on xd018. After 1000 s, thenumber of client sessions is increased back to 55; placement controllersuspends JOB1 (G). Typically, JOB1 will later be resumed when any of thefollowing conditions occur: (1) JOB5 completes, (2) load to StockTradeis reduced, or (3) JOB1 gets close enough to its target completion timeso as to necessitate its resumption, even at the expense of worsenedperformance for StockTrade. However, the occurrence of the thirdcondition indicates that the system in an embodiment of the presentinvention is under-provisioned; hence SLA violation may not be avoided.This simple experiment demonstrates that with the use of servervirtualization, the system in an embodiment of the present invention isable to balance resource usage between web and non-interactiveworkloads.

The usefulness of server virtualization technology is shown in themanagement of homogeneous, in this case non-interactive workloads. Usingthe same experimental set-up as in Section V-B, a test case is run thatinvolves only long-running jobs shown in FIG. 4. Referring now to FIG.9, a diagram illustrates node utilization by long running jobs,according to an embodiment of the present invention.

The test case is started by submitting JOB1 (A), which is started onxd020 and takes its entire CPU power. Soon after JOB1 is submitted, JOB2and JOB3 (B) are submitted, which both get started on xd018 and each ofthem is allocated one CPU on the machine. Ten minutes later, JOB4 (C) issubmitted, which has a very strict completion time requirement. In orderto meet this requirement, APC decides to suspend JOB1 and start JOB4 inits place. Note that if JOB1 was allowed to complete before JOB4 isallowed to start, JOB4 would wait 5 min in the queue, hence it wouldcomplete no earlier than 13 min after its submission time, which wouldexceed its goal. Instead, JOB4 is started as soon as it arrives andcompletes within 10 min, which is within its goal. While JOB4 isrunning, JOB5 (D) is submitted. However, JOB5 belongs to a lower classthan any job currently running, and therefore is placed in the queue.When JOB4 completes, JOB5 is started on xd020. Since JOB5 consumes only1 CPU, APC also resumes JOB1 and allocates it the remaining CPU.However, to avoid Xen stability problems in the presence of resourcecontrol mechanisms, the resource control action is suppressed. As aresult, resolving competition for CPUs is delegated to Xen hypervisor.

In the next phase of the experiment, the use of migration isdemonstrated. After the completion of JOB1 and JOB3, submit JOB6 (E) issubmitted. When JOB6 arrives, JOB2 and JOB5 each consume 1 CPU on xd018and xd020 respectively. Since JOB6 requires 2 CPUs, APC may either (1)make it wait in the queue, (2) suspend JOB2 or JOB5, (3) collocate andresource control JOB6 with either JOB2 or JOB5, or (4) migrate eitherJOB2 or JOB5. Options (1)-(3) would result in wasted capacity on one orboth machines. Moreover, options (1) and (3) would result in havingplatinum class job receive proportionately less CPU power than JOB5,whose service class is gold. This would clearly not be the optimaldecision from the perspective of the optimization objective. Hence, APCdecides (E) to move JOB4 to xd018 (which it will now share with JOB5)and start JOB6 on the now-empty xd020.

Even though this experiment shows that APC correctly uses migration whenmachine fragmentation makes it difficult to place new jobs, it alsodemonstrates a limitation of the optimization technique, which iscurrently oblivious to the cost of performing automation actions.Although in this experiment, 15 min is an acceptable price to pay formigrating a job, it is easy to imagine a scenario, where performing sucha costly migration would have a damaging effect.

Potential benefits of using virtualization technology in the managementof non-interactive workloads are studied. A system is simulated in whichjobs with characteristics similar to the ones in Table I are submittedrandomly with exponentially distributed inter-arrival times. Theworkload mix includes 25% multithreaded jobs with execution time of 32min, 25% multithreaded jobs with execution time of 23 min, 25%single-threaded jobs with execution time of 66 min, 15% single-threadedjobs with execution time of 45 min, and 10% single-threaded jobs withexecution time of 127 min. The service class distribution for all jobsis 50%, 40%, and 10% for platinum, gold, and silver service class,respectively. Mean inter-arrival time is varied between 8 and 30 min.

The simulation does not model the cost of performing virtualizationactions. Hence, the results concern the theoretical bound on theparticular algorithmic technique used.

The placement algorithm (APC) is evaluated with well known schedulingtechniques: fast-come-fast-serve (FCFS) and earliest-deadline-first(EDF), in which completion time goal is interpreted as deadline. Theplacement technique is executed after disabling automation mechanismsprovided by virtualization technology (APC_NO_KNOBS).

Referring now to FIG. 10, a diagram illustrates a percentage of jobsthat have not met their completion time goal as a function ofinter-arrival time, according to an embodiment of the present invention.When APC uses virtualization mechanisms, it performs much better thanFCFS and EDF. Throughout the experiment, it does not violate any SLAs,with the exception of a high-overload case corresponding to jobinter-arrival time of 8 min. In the overloaded system, the technique has20-30% lower number of missed targets that FCFC and EDF, which is notshown FIG. 10. When virtualization mechanisms are not used, thealgorithm is no better or worse than EDF. This shows that theimprovement observed in the case of APC is truly due to the use ofvirtualization technology and not due to new clever schedulingtechnique.

Referring now to FIG. 11, a diagram illustrates a number of suspendoperations, according to an embodiment of the present invention.Referring to FIG. 12, a diagram illustrates a sum of migrations andmove-and-restore actions, according to an embodiment of the presentinvention. Not surprisingly, as load increases, the number of actionsalso increases. With very high load each job is suspended and moved morethan once, which in practice will increase its execution time. In orderto benefit from the usage of the automation mechanism in practice, it istherefore important to consider the cost of automation mechanisms in theoptimization problem solved by APC. Such costs may be considered.

Referring now to FIG. 13, a block diagram illustrates an illustrativehardware implementation of a computing system in accordance with whichone or more components/methodologies of the invention (e.g.,components/methodologies described in the context of FIGS. 1-5) may beimplemented, according to an embodiment of the present invention.

As shown, the computer system may be implemented in accordance with aprocessor 1310, a memory 1312, I/O devices 1314, and a network interface1316, coupled via a computer bus 1318 or alternate connectionarrangement.

It is to be appreciated that the term “processor” as used herein isintended to include any processing device, such as, for example, onethat includes a CPU (central processing unit) and/or other processingcircuitry. It is also to be understood that the term “processor” mayrefer to more than one processing device and that various elementsassociated with a processing device may be shared by other processingdevices.

The term “memory” as used herein is intended to include memoryassociated with a processor or CPU, such as, for example, RAM, ROM, afixed memory device (e.g., hard drive), a removable memory device (e.g.,diskette), flash memory, etc.

In addition, the phrase “input/output devices” or “I/O devices” as usedherein is intended to include, for example, one or more input devicesfor entering speech or text into the processing unit, and/or one or moreoutput devices for outputting speech associated with the processingunit. The user input speech and the speech-to-speech translation systemoutput speech may be provided in accordance with one or more of the I/Odevices.

Still further, the phrase “network interface” as used herein is intendedto include, for example, one or more transceivers to permit the computersystem to communicate with another computer system via an appropriatecommunications protocol.

Software components including instructions or code for performing themethodologies described herein may be stored in one or more of theassociated memory devices (e.g., ROM, fixed or removable memory) and,when ready to be utilized, loaded in part or in whole (e.g., into RAM)and executed by a CPU.

While previous techniques concentrate on managing virtual machines asprimary abstractions that are exposed to end user, the technique managesapplications using automation mechanisms provided by virtual servers. Anapplication-centric approach is taken and the usage of VM technology iskept as invisible to end user as possible.

The system allows management of heterogeneous workloads on a set ofheterogeneous server machines using automation mechanisms provided byserver virtualization technologies. The system introduces several novelfeatures. First, it allows heterogeneous workloads to be collocated onany server machine, thus reducing the granularity of resourceallocation. Second, the approach uses high-level performance goals (asopposed to lower-level resource requirements) to drive resourceallocation. Third, the technique exploits a range of new automationmechanisms that will also benefit a system with a homogeneous,particularly non-interactive, workload by allowing more effectivescheduling of jobs.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

1. A system for management of heterogeneous workloads comprising: aplurality of heterogeneous server machines; a placement controller,driven by utility functions of allocated computer processing unit (CPU)demand, that places web applications on one or more of the plurality ofheterogeneous server machines; a request router that receives anddispatches requests to one or more web applications on one or more ofthe plurality of heterogeneous server machines in accordance with ascheduling mechanism; a flow controller in communication with therequest router and the placement controller that dynamically adjusts thescheduling mechanism in response to at least one of workload intensityand system configuration; and a job scheduler that allocates jobs to oneor more of the plurality of heterogeneous server machines in accordancewith placement decisions communicated to the job scheduler by theplacement controller.
 2. The system of claim 1, wherein the webapplications placed on one or more of the plurality of heterogeneousserver machines comprise application server clusters.
 3. The system ofclaim 1, wherein the placement controller comprises: a job schedulerproxy that receives jobs from the job scheduler and communicates jobplacement decisions to the job scheduler; a job utility estimator thatestimates requirements for jobs for a given value of utility function; aplacement optimizer that calculates job and web application placementthat maximizes the minimum utility across all web applications; and aweb application executor that places web applications on one or more ofthe plurality of heterogeneous server machines.
 4. The system of claim 1wherein the request router classifies requests into flows depending onat least one of target web applications and service class.
 5. The systemof claim 4, wherein the request router places classified requests intoper-flow queues within the request router.
 6. The system of claim 5,wherein the request router dispatches requests from the per-flow queuesbased on a weighted-fair scheduling discipline that observes aconcurrency limit of the system.
 7. The system of claim 6, wherein theflow controller adjusts at least one of weights of the weighted-fairscheduling discipline and the concurrency limit of the system inresponse to at least one of changing workload intensity and systemconfiguration.
 8. The system of claim 7, wherein the flow controllerbuilds a model of the system that allows it to predict the performanceof a flow for any choice of concurrency limit and weights.
 9. The systemof claim 1, wherein the flow controller comprises a utility functioncalculator in communication with the placement controller.
 10. Thesystem of claim 1, wherein the job scheduler manages dependencies amongjobs and performs resource matching of the jobs with one or more of theplurality of heterogeneous server machines.
 11. The system of claim 1,further comprising: a web application workload profiler that obtainsprofiles for web application requests in the form of an average numberof CPU cycles consumed by requests of a given flow, and provides theprofiles for web application requests to the flow controller andplacement controller; and a job workload profiler that obtains profilesfor jobs in the form of at least one of the number of CPU cyclesrequired to complete the job, the number of threads used by the job, andthe maximum CPU speed at which the job may progress, and provides theprofiles for jobs to the job scheduler.
 12. The system of claim 1,wherein the each of the plurality of heterogeneous server machinescomprise a plurality of domains.
 13. The system of claim 1, wherein afirst domain of the plurality of domains comprises a node agent incommunication with the placement controller and the job scheduler. 14.The system of claim 13, wherein a second domain of the plurality ofdomains comprises a machine agent in communication with the node agentthat manages virtual machines inside a given heterogeneous servermachine.
 15. The system of claim 14, wherein the node agent provides jobmanagement functionality within the heterogeneous server machine throughinteraction with the machine agent.
 16. The system of claim 14, whereinthe machine agent is capable of at least one of creating and configuresa virtual machine image for a new domain, copying files from the seconddomain to another domain, starting a process in another domain, andcontrolling the mapping of physical resources to virtual resources.